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Abstract: This paper presents a practical and simple fully nonparametric multivariate smooth- 
ing procedure that adapts to the underlying smoothness of the true regression function. Our 
estimator is easily computed by successive application of existing base smoothers (without 
the need of selecting an optimal smoothing parameter), such as thin-plate spline or kernel 
smoothers. The resulting smoother has better out of sample predictive capabilities than the 
underlying base smoother, or competing structurally constrained models (GAM) for small 
dimension (3 < d < 7) and moderate sample size n < 800. Moreover our estimator is still 
useful when d > 10 and to our knowledge, no other adaptive fully nonparametric regression 
estimator is available without constrained assumption such as additivity for example. On a 
real example, the Boston Housing Data, our method reduces the out of sample prediction 
error by 20%. An R package ibr, available at CRAN, implements the proposed multivariate 
nonparametric method in R. 
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1. Introduction 

Regression is a fundamental data analysis tool for uncovering functional relationships between 
pairs of observations {Xi,Yi),i = 1, . . . ,n. The traditional approach specifies a parametric family 
of regression functions to describe the conditional expectation of the response variable Y given the 
independent multivariate variables X G K.'', and estimates the free parameters by minimizing the 
squared error between the predicted values and the data. An alternative approach is to assume that 
the regression function varies smoothly in the independent variable x and then estimate locally the 
conditional expectation m(x) = K[Y\X ~ x]. This results in nonparametric regression estimators. 
We refer the interested reader to Eubank [1999], Fan and Gijbels [1996] and Simonoff [1996] for a 
more in depth treatment of various classical regression smoothers. The vector of predicted values Yi 
at the observed covariates Xi from a nonparametric regression is called a regression smoother, or 
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simply a smoother, because the predicted values Yi are less variable than the original observations 
Yi. Operationally, linear smoothers can be written as 

m = SY, 

where 5" is a 71 x n smoothing matrix. Smoothing matrices S (or S\) typically depend on a tuning 
parameter, which we denote by A, that governs the tradeoff between the smoothness of the estimate 
and the goodness-of-fit of the smoother to the data, by controlling the effective size of the local 
neighborhood of the explanatory variable over which the responses are averaged. We parameterize 
the smoothing matrix such that large values of A will produce very smooth curves while small A will 
produce a more wiggly curve that wants to interpolate the data. For example, the tuning parameter 
A is the bandwidth for kernel smoother, the span size for running-mean smoother, and the scalar 
that governs the smoothness penalty term for thin plate splines (TPS). 

It is well known that given n uniformly distributed points in the unit cube [—1, 1]'', the expected 
number of points that are covered by a ball centered at the origin with radius e < 1, scales as 
ne'^. This is to say that covariates in high dimensions are typically sparse. This phenomenon is 
sometimes called the curse of dimensionality. As a consequence, nonparametric smoothers must av- 
erage over larger neighborhoods, which in turn produces more heavily biased smoothers. Optimally 
selecting the smoothing parameter does not alleviate this problem. Indeed, when the regression 
function m mapping R"* to M belongs to some finite smoothness functional classes (Holder, Sobolcv, 
Bcsov) the optimal mean squared error rate of convergence is 7j"2iy/(2iy+d) .y^rjjgj-g jy jg ^jjg smoothing 
index [see for example Tsybakov, 2009]. Common wisdom suggest avoiding general nonparamet- 
ric smoothing in moderate dimensions (say d > 5) and focus instead on fitting structurally con- 
strained regression models, such as additive [Hastie and Tibshirani, 1995, Linton and Nielsen, 1995, 
Hengartner and Sperlich, 2005] and projection pursuit models [Friedman and Stuetzle, 1981]. The 
popularity of additive models stems in part from the interpretability of the individual estimated 
additive components, and from the fact that the estimated regression function converges to the 
best additive approximation of the true regression function at the optimal univariate mean squared 
error rate of n~^'^^^^'^'^^\ While additive models do not estimate the true underlying regression 
function, one hopes for the approximation error to be small enough so that for moderate sample 
sizes, the prediction mean square error of the additive model is less than the prediction error of a 
fully nonparametric regression model. 

The impact of the curse of dimensionality is lessened for very smooth regression functions. For 
regression functions with v = 2d continuous derivatives, the optimal rate is n~^/^, a value recog- 
nized as the optimal mean squared error of estimates for twice differentiable univariate regression 
functions. The difficulty is that in practice, the smoothness of the regression function is typically 
unknown. Nevertheless, there are large potential gains (in terms of rates of convergence) if one 
considers multivariate smoothers that adapt to the smoothness of the regression function. Since the 
pioneer work of Lepski [1991], adaptive nonparametric estimation became a major topic in math- 
ematical statistics (see for example Gyorfi et al. 2002 or Tsybakov 2009). Adaptive nonparametric 
estimator can be achieve either by direct estimation (see Lepski's method and related papers) or 
by aggregation of different procedures [see Yang, 2000] . This paper presents a practical and simple 
nonparametric multivariate smoothing procedure that adapts to the underlying smoothness of the 
true regression function. Our estimator is easily computed by successive application of existing 
smoothers, such as Thin Plate Spline or kernel smoother. Thanks to adaptivity (proven for TPS 
smoother), our estimator behaves nicely in small dimension (3 < 0? < 7) with moderate sample size 
n < 800 and remains useful when d > 10. 
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Section 2 introduces our procedure and motivates it as repeated corrections to the bias of a 
smoother, where at each step, the bias is estimated by smoothing the residuals. We use the Gener- 
ahzed Cross- Vahdation (GCV) criteria to stop our iterative procedure when the prediction error of 
our estimate is nearly minimized. The idea of estimating the bias from residuals to correct a pilot 
estimator of a regression function goes back to the concept of twicing introduced by Tukey [1977] 
to estimate bias of misspecified multivariate regression models. Numerous authors have shown the 
benefits of various bias reduction techniques in nonparametric regression, including He and Huang 
[2009], Choi et al. [2000], Choi and Hall [1998], Hcngartner et al. [2010], Hirukawa [2010]. 

The idea of iterative debiasing regression smoothers is already present in Breiman [1999] in the 
context of the bagging algorithm. More recently, the interpretation of the Z/2-boosting algorithm as 
an iterative bias correction scheme was alluded to in Ridgeway [2000] 's discussion of Friedman et al. 
[2000] paper on the statistical interpretation of boosting. Biihlmann and Yu [2003] present the 
statistical properties of the Z/2-boosted univariate smoothing splines and proposed an additive 
procedure to deal with multivariate data. Di Marzio and Taylor [2008] describes the behavior of 
univariate kernel smoothers after a single bias-correction iteration. 

Section 3 applies the iterative bias reduction procedure to multivariate Thin Plate Spline smoothers. 
TPS smoothers have attractive theoretical properties that facilitate our proofs of the adaptation 
to the unknown smoothness of our procedure. However, implementation of the TPS is limited by 
the need of the sample size to be larger than size of its parametric component. The latter grows 
exponentially with the dimension of the covariates d. For practical considerations, we consider, in 
Section 4, the iterative bias reduction procedure using kernel smoothers that can be applied more 
generally than TPS. We discuss the use of different kernels since the choice of the kernel is important 
for the iterative bias reduction procedure. 

The simulation results presented in Section 5 show that for moderate dimensions of the covariates 
(eg. 3 < d < 7), and sample sizes ranging from n = 50 to n = 800, our iterated smoother has 
significantly smaller prediction error than the base smoother with using an "optimal smoothing" 
parameter. We end this section with the prediction of the classical Boston housing data set {n = 506 
and d = 13). The interested reader can download an R implementation of our procedure with 
optimized computations for moderate sample size [Cornillon et al.. 2010]. 

Finally, the proofs are gathered in the Appendix. 

2. Iterative bias reduction 

This section presents the general iterative bias reduction framework for linear regression smoothers 
and shows that the resulting smoother, when combined with GCV, adapts to the underlying smooth- 
ness of the regression function. The advantage of our smoother is its simplicity: we only need to 
repeatedly smooth residuals using existing multivariate smoothers. The cost of adaptation is an 
increase in computational complexity. 

2. 1 . Preliminaries 

Suppose that the pairs [Xi^Yi) G K'' x M are related through the regression model 

Yi = m{X,)+ei, i = l,...,n, (1) 

where m(-) is an unknown smooth function, and the disturbances are independent mean zero and 
variance ct^ random variables that are independent of all the covariates (^i, . . . ,^n)- It is helpful 
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to rewrite Equation (1) in vector form by setting Y = (Yi, . . . , Yn)' , m — {m{Xi), . . . , m(X„))' and 
£ = (ei, . . . to get 

Y = m + e. (2) 

Linear smoothers can be written as 

m = SxY, (3) 

where S'a is an n x n smoothing matrix and fh = (Yi, . . . , Yn)' , denotes the vector of fitted values. 
From now on, we denote the smoothing matrix by S. Let / be the nxn identity matrix. The bias of 
the hnear smoother (3), conditionally on the observed values of the covariates X" = (Xi, . . . , X„), 
is 

E[m\X^]-m = {S - I)m = -E[{I - S)Y\X^]. (4) 
2.2. Bias reduction of linear smoothers 

Expression (4) for the bias suggests that it can be estimated by smoothing the negative residuals 
—R = —{Y — fh) = —{I — S)Y. An alternative approach is to estimate the bias by plugging in 
an estimator for the regression function m into the expression (4). The resulting estimators are 
different except if we consider using the same smoother for estimating the bias and for estimating 
the initial smoother. From now on, we consider using the same smoother. The initial estimator is 
given by 

mi = SY SiY. 

Smoothing the residuals 

bi ~ -SRi = -S{I~Si)Y 

estimates the bias. Correcting the initial smoother fhi by subtracting bi yields a bias corrected 
smoother 

7712 = SiY ~ bi 

= SiY + S{I - Si)Y := S2Y. 

Since 1712 is itself a linear smoother, it is possible to correct its bias as well. Repeating the bias 
reduction step fc — 1 times produces the linear smoother at iteration k: 

fhk - Sk-iY + S{I~Sk-i)Y:^Sk-iY-bk-i:^SkY 

The resulting fc*'' iterated bias corrected smoother becomes 

mfe = [/-(/ - S)'']Y := SkY. (5) 

In the univariate case, smoothers of the form (5) arise from the L2-boosting algorithm when setting 
the convergence factor fi^ of that algorithm to one. Thus we can interpret the L2-boosting algorithm 
as an iterative bias reduction procedure. From that interpretation, it follows that the i2-boosting of 
projection smoothers, as is the case for polynomial regression, bin smoothers and regression splines, 
is ineffective since the estimated bias 

b= S{I - S)Y = 0. 
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2.3. Predictive smoothers 

Our smoothers predict the conditional expectation of responses only at the design points. It is useful 
to extend regression smoothers to enable predictions at arbitrary locations x G K'^ of the covariates. 
Such an extension allows us to assess and compare the quality of various smoothers by how well 
the smoother predicts new observations. To this end, write the prediction of the linear smoother S 
at an arbitrary location x as 

■m{x) = SixYY, 

where S{x) is a vector of size n whose entries are the weights for predicting m(x). The vector S{x) 
is readily computed for many of the smoothers used in practice. Next, writing the iterative bias 
corrected smoother fhk as 

frik = mi - 61 H 

= S[I + (I-S) + {I -Sf + ■■■ + (!- S'f'^jY 

it follows that we can predict m{x) by 

irikix) = SixYfik- (6) 

2.4. Properties of iterative bias corrected smoothers 

The squared bias and variance of the /c*'' iterated bias corrected smoother ihk (5) are 

iE[mk\X^] - mf = m' {{I ~ S)'')' (I - S)''m 

var(mfc|Xn = a^I - {I - S)") {{I - {I - S)''))' , 

This shows that the qualitative behavior of the sequence of iterative bias corrected smoothers rhk 
can be related to the spectrum of /— S". The next proposition collects the various results for sequence 
of iterated bias corrected linear smoothers. 

Proposition 1 Suppose that the singular values Xj of I — S satisfy 

0<A, <1 for j = l,...,n. (7) 

Then we have that 

\\h\\ < fh-iW and lim bk = 0, 

A:— >cxD 

lim fhk = Y and lim E[||mfc — m||^|X"] = ncr^ . 

The assumption that for all 7, the singular values < Aj < 1 implies that / — S* is a contraction, 
so that 11(7 — S)Y\\ < \\Y\\. This condition however does not imply that the smoother S is itself 
a shrinkage smoother as defined by Buja et al. [1989]. Conversely, not all shrinkage smoothers 
satisfy condition (7) of the theorem. In Sections 3 and 4, wc give examples of common shrinkage 
smoothers for which Xj > 1, and show numerically that for these shrinkage smoothers, the iterative 
bias correction scheme fails. 
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The proposition indicates that the number of iterations of the bias correction scheme is analogous 
to smoothing parameters of more classical smoothers: For small numbers of iterations, the smoother 
is very smooth, becoming increasingly wiggly as the number of iterations increases, to ultimately 
interpolate the data. Smoothers at either extreme (oversmoothing or interpolating the data) may 
have large prediction errors, and the presumptions is that along the sequence of bias corrected 
smoother smoothers, there will be smoothers that have significantly smaller prediction errors. In 
Section 3, we show no only that this fact holds for thin plate smoothing splines, but that there exists 
smoothers in that sequence that "adapts to the unknown smoothness" of the regression function 
and achieves the optimal rate of convergence. Since standard thin plate spline smoothers are not 
adaptive, this demonstrates the usefulness of iterative bias correction. 



2.5. Data-driven selection of the number of steps 

The choice of the number of iterations is crucial since each iteration of the bias correction algorithm 
reduces the bias and increases the variance. Often a few iterations of the bias correction scheme will 
improve upon the pilot smoother. This brings up the important question of how to decide when to 
stop the iterative bias correction process. 

Viewing the latter question as a model selection problem suggests stopping rules for the number of 
iterations based on Akaike Information Criteria (AIC) [Akaike, 1973], modified AIC [Hurvich et al., 
1998], Bayesian Information Criterion (BIG) [Schwarz, 1978], cross-validation, L-fold cross-validation. 
Generalized cross validation [Graven and Wahba, 1979], and data splitting [Hengartner et al., 2002]. 
Each of these data-driven model selection methods estimate an optimum number of iterations k of 
the iterative bias correction algorithm by minimizing estimates for the expected squared prediction 
error of the smoothers over some pre-specified set /C„ = {1, 2, . . . , M„} for the number of iterations. 

Extensive simulations of the above mentioned model selection criteria, both in the univariate 
and the multivariate settings [Cornillon et al., 2008] have shown that GGV 



kccv = argmin <j logCTj^ - 2 log ( 1 



trace(S'/c) 



is a good choice, both in terms of computational efficiencies and of producing good final smoothers 
and asymptotic results (cf Theorem 2). At each iteration, (t^^ corresponds to the estimated variance 
of the current residuals. 

Strongly related to the number of iteration is the smoothness of the pilot smoother, since the 
smoother the pilot is, the bigger is the number of iteration. This point and the algorithm used to 
select the number of iteration are not developed in this paper but are presented in greater detail in 
the companion paper related to the R-package. However, one has to be sure that the pilot smoother 
oversmooths. We will discuss that point in the simulation part, since it depends on the type of 
smoother (thin plate spline, kernel). 
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3. Iterative bias reduction of multivariate thin-plate splines smoothers 

We study the statistical properties of the iterative bias reduction of multivariate thin-plate spline 
smoothers. Given a smoothing parameter A, the thin-plate smoother of degree vq minimizes 



iin^(y,-/(X,)f + A 
1=1 



E 



n, ■ ■ ■ .id > 

il + ■■■+ id < -^0 



gilA hid 



. dxi 



dx 



(8) 



Thin-platc smoothing splines are an attractive class of multivariate smoothers for two reasons: 
first, the solution of (8) is numerically tractable [sec Gu, 2002], and second, the eigenvalues of the 
smoothing matrix are approximativcly known [see Utrcras, 1988]. 



3.1. Numerical example 

The eigenvalues of the associated smoothing matrix lie between zero and one. In light of proposition 
1, the sequence of bias corrected thin-plate spline smoothers, starting from a pilot that ovcrsmooths 
the data, will converge to an intcrpolant of the raw data. As a result, we anticipate that after some 
suitable number of bias correction steps, the resulting bias corrected smoother will be a good 
estimate for the true underlying regression function. This behavior is confirmed numerically in the 
following pedagogical example of a bivariate regression problem: Figure 1 graphs Wendelberger's 
test function [Wendelberger, 1982] 



m{x,y) 



exp 
exp 



{9x - 2)2 -t- (9y - 2f 



■ exp 



(9a; 1)2 ^ (9?/-hl)2 



49 



10 



.<5£^^£±i5til!)4oxp(-(..-4)^-(9.-7n (9) 



that is sampled at 100 locations on the regular grid {0.05, 0.15, 0.85, 0.95}2. The disturbances 
are mean zero Gaussian with variance producing a signal to noise ratio of five. 



True Function 



Data 





Fig 1. True regression function m{xi,X2) (9) on the square [0,1] X [0,1] used in our numerical examples and a 
sample of size 100 with errors and a sample of 100 points 
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Figure 2 shows the cvokition of the bias corrected smoother, starting from a nearly hnear pilot 
smoother in panel (a). At iteration k = 500 (or 499 iterative bias reduction steps), the smoother 
shown in panel (b) is visually close to the original regression function. Continuing the bias correction 
scheme will eventually lead to a smoother that interpolates the raw data. This example shows the 
importance of suitably selecting the number of bias correction iterations. 



(a) (b) (c) 




Fig 2. TPS regression smoothers from 100 noisy observations from (9) (see Figure 1 ) evaluated on a regular grid on 
[0,1] X [0,1]. Panel (a) shows the pilot smoother, panel (b) graphs the bias corrected smoother after 500 iterations 
and panel (c) graphs the smoother after 50000 iterations of the bias correction scheme. 



3.2. Adaptation to smoothness of the regression function 

Let f2 be an open bounded subset of and suppose that the unknown regression function m 
belongs to the Sobolev space •H('')(rj) = H^''^ where v is an integer such that v > d/2. Let S 
denote the smoothing matrix of a thin-plate spline of order i^o < (in practice we will talic the 
smallest possible value i/q = L'^/^J + 1) and fix the smoothing parameter Ag > to some reasonably 
large value. Our next theorem states that there exists a number of iterations k = k{n), depending 
on the sample size, for which the resulting estimate m/; achieves the optimal rate of convergence. 
In light of that theorem, we expect that an iterative bias corrected smoother, with the number of 
iterations selected by GCV, will achieve the optimal rate of convergence. 

Theorem 1 Assume that the design Xi G 51, i = 1, . . . , n satisfies the following assumption: Define 
hrnaxin) = su-p iuf \x - X.^l, and h„iinin) ^ mill \X, - Xj\, 

and assume that there exists a constant B > such that 

limin yn J 

Suppose that the true regression function m G H^'^\ 

If the initial estimator rhi = SY is obtained with S a thin-plate spline of degree vq, with [d/2] < 
vq < v and a fixed smoothing parameter Aq > not depending on the sample size n, then there is 
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an optimal number of iterations k{n) such that the resulting smoother fhk satisfies 



E 



which is the optimal rate of convergence for m G 'H^'^K 

While adaptation of the ij2-boosting algorithm applied to univariate smoothing splines was proven 
by Biihlmann and Yu [2003] , the application of bias reduction to achieve adaptation to the smooth- 
ness of multivariate regression function has not been previously exploited. Rate optimality of the 
smoother rhk is achieved by suitable selection of the number of bias correcting iterations, while the 
smoothing parameter Aq remains unchanged. That is, the effective size of the neighborhoods the 
smoother averages over remains constant. Selecting the optimal number of iterations is important 
and we prove that result with GCV criterion using Theorem 3.2 of Li [1987]. 



Theorem 2 Let kccv G ~ {Ij • • ■ i L"-^J}j 1^7^ {'^vo)/d, denote the index in the sequence 
of bias corrected smothers whose associated smoother minimize the generalized cross-validation cri- 
teria. Suppose that the noise e in (1) has finite Aq*^ absolute moment, where q > ^{2iy/d+ 1), that 
is, E[|£|^'^] < oo. Then as the sample size n grows to infinity. 



^ — m\\ 

^1, in probability. 



The moment condition is satisfied for Gaussian or subgaussian errors. 



4. Iterative bias reduction of kernel smoothers 



The matrix S of thin plate spline is symmetric and has eigenvalues in (0, 1] [see for example Utreras, 
1988]. In particular, the first Mq = (""i/^i^T^) eigenvalues are all equal to one, corresponding to the 
parametric component of the smoothing spline. The sample size n needs to be at least Mq, and since 
from Theorem 1 we want > d/2, it follows that Mg grows exponentially fast in the number of 
covariates d. In particular the dimension of the parametric component of freedom is 5, 28, 165, 1001 
for d — 4,6,8,10, respectively, and more generally, A/o grows like 3*^/^ x (3/2)'^ for large d. This 
feature limits the practical usefulness of TPS smoothers. For example, the regression model in 
Section 5 for the Boston housing data set that has 13 covariates can not be fit with a TPS because 
its sample size n = 506 < 27500 « A/q. 

A possible resolution to this problem is to approximate the TPS smoother with a kernel smoother, 
with an appropriate kernel [see Silverman, 1984, Messer, 1991]. In this section, we discuss kernel 
based smoothers in general, and we give a necessary and sufficient condition on the kernel that 
ensures that the iterative bias correction scheme is well behaved. We supplement our theorems with 
numerical examples of both good and bad behavior of our scheme. 



^.1. Kernel type smoothers 

The matrix S of kernel estimators has entries Sij = K{dh{Xi, Xj))/^j^ K{dh{Xi, Xj)), where K{.) 
is typically a symmetric function in K (e.g., uniform, Epanechnikov, Gaussian), and dh{x,y) is 
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a weighted distance between two vectors x,y £ M'^. The particular choice of the distance 
determines the shape of the neighborhood. For example, the weighted Euclidean norm 

dh{x,y) = 

where h = (hi, . . . , hd) denotes the bandwidth vector, gives rise to elliptic neighborhoods. 



4-2. Spectrum of kernel smoothers 

While the smoothing matrix S is not symmetric, it has a real spectrum. Write S = DK, where 
IK is symmetric matrix with general element K^j = K{dh{Xi,Xj)) and D is diagonal matrix with 
elements Da = 1/ K{dh(Xi, Xj)). li q is an eigenvector of 5* associated to the eigenvalue A, then 

Sq = DKq = D^/^ (i?l/2Ki?l/2^ ^-1/2^ ^ 

and hence 

Hence the symmetric matrix A ~ D^/^KZ?^/^ has the same spectrum as S. Since S is row-stochastic, 
all its eigenvalues are bounded by one. Thus, in light of Theorem 1, we seek conditions on the kernel 
K to ensure that its spectrum is non-negative. Necessary and sufficient conditions on the smoothing 
kernel K for S to have a non-negative spectrum are given in the following Theorem. 

Theorem 3 // the inverse Fourier- Stieltjes transform of a kernel K{-) is a real positive finite 
measure, then the spectrum of the Nadaraya-Watson kernel smoother lies between zero and one. 

Conversely, suppose that Xi, . . . , X„ are an independent n-sample from a density f (with respect 
to Lebesgue measure) that is bounded away from zero on a compact set strictly included in the 
support of f . If the inverse Fourier- Stieltjes transform of a kernel K{-) is not a positive finite 
measure, then with probability approaching one as the sample size n grows to infinity, the maximum 
of the spectrum of I — S is larger than one. 

Remark 1: The assumption that the inverse Fouricr-Stieltjcs transform of a kernel K{-) is a real 
positive finite measure is equivalent to the kernel A'(-) being positive-definite function, that is, for 
any finite set of points xi, . . . , Xm, the matrix 

/ A'(0) K{dh{xi,X2)) K{dhixi,X3)) ... K{dhixi,x„i)) \ 

K{dh{x2,xi)) K{0) K{dh{x2,X3)) ... K{dh{x2,x„i)) 

\ K{dh{Xrn,Xl)) K{dh{Xra,X2)) K {dhix„i, X3)) ... K {0) J 

is positive definite. We refer to Schwartz [1993] for a detailed study of positive definite functions. 
Remark 2: Di Marzio and Taylor [2008] proved the first part of the theorem in the context of 
univariate smoothers. Our proof of the converse shows that for large enough sample sizes, most 
configurations from a random design lead to smoothing matrix S with negative singular values. 
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Iterative smoothing of the residuals can be coniputationany burdensome. To derive an alternative, 
and computationally more efficient representation of the iterative bias corrected smoother, observe 
that 

mt = [J-Di/2(/_Di/2KDi/2)'=i?-V2]y 
= D^/^[I - {I - A)'']D-^/'^Y. 

Writing A = D^^^KD^^^ — PaA^^P^, where Pa is the orthonormal matrix of eigenvectors and Aa 
diagonal matrix of their associated eigenvalues, we obtain a computationally efficient representation 
for the smoother 

mfc = D'^'Pa[I-{I-Aa)'']PXD-'/^Y. 

Note that the eigenvalue decomposition of A needs only to be computed once, and hence leads to 
a fast implementation for calculating the sequence of bias corrected smoothers. 

The Gaussian and triangular kernels are positive definite kernels (they are the Fourier transform 
of a finite positive measure, [Feller, 1966]). In light of Theorem 3, the iterative bias correction 
of Nadaraya- Watson kernel smoothers with these kernels produces a sequence of well behavior 
smoother. 

The anticipated behavior of iterative bias correction for Gaussian kernel smoothers is confirmed 
in our numerical example. Figure 3 shows the progression of the sequence of bias corrected smoothers 
starting from a very smooth surface (see panel (a)) that is nearly constant. Fifty iterations (see panel 
(b)) produces a fit that is visually similar to the original function. Continued bias corrections then 
slowly degrades the fit as the smoother starts to over-fit the data. Continuing the bias correction 
scheme will eventually lead to a smoother that interpolates the data. This example hints at the 
potential gains that can be realized by suitably selecting the number of bias correction steps. 



(a) (b) (c) 




Fig 3. Gaussian kernel smoother ofm{x\,X2) from n = 100 equidistributed points on [0,1] X [0,1], evaluated on a 
regular grid with (a) k = 1, (b) 50 and (c) 10000 iterations. 

The uniform and the Epanechnikov kernels are not positive definite. Theorem 3 states that 
for large enough samples, we expect with high probability that I — S has at least one eigenvalue 
larger than one. When this occurs, the sequence of iterative bias corrected smoothers will behave 
erratically and eventually diverge. Proposition 2 below strengthens this result by giving an explicit 
condition on the configurations of the design points for which the largest singular value of / — 5 is 
always larger than one. 
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Proposition 2 Denote by Mi the following set: {Xj : K{dh{Xj,Xi)) > 0}. 

If there exists a set Mi which contains (at least) two points Xj,Xk different of Xi such that 
dh{Xi, Xj) < 1, dh{Xi, Xk) < 1 and dh{Xj, Xk) > 1, then the smoothing matrix S for the uniform 
kernel smoother has at least one negative eigenvalue. 

If there exits a set Mi that contains (at least) two points Xj,Xk ifferent of Xi that satisfy 

dh{X,,Xk) > min{dh{X„Xj),dh{X„Xk)}, 

then the smoothing matrix S for the Epanechnikov kernel smoother has at least one negative eigen- 
value. 

The failure of the iterated bias correction scheme using Epanechnikov kernel smoothers is illustrated 
in the numerical example shown in Figure 4. As for the Gaussian smoother, the initial smoother 
(panel (a)) is nearly constant. After five iterations (panel (b)) some of the features of the function 
become visible. Continuing the bias corrections scheme produces an unstable smoother. Panel (c) 
shows that after only 25 iterations, the smoother becomes noisy. Nevertheless, when comparing 
panel (a) with panel (b), we see that some improvement is possible from a few iterations of the bias 
reduction scheme. 



(a) (b) (c) 




Fig 4. Epanechnikov kernel smoother of m{x\, X2) from n = 100 equidistributed points on [0, 1] X [0, 1], evaluated on 
a regular grid with (a) k = 1, (b) 5 and (c) 25 iterations. 



5. Simulations and a real example 

This section presents the results of a modest simulation study to compare the empirical mean 
squared error 

1 " 

MSE = -y^(m(X^) -miXiff (10) 

i=l 

of our procedure to its competitors for two functions, in dimensions d = 3, 5, 7 and sample sizes 
n = 50,100,200,500,800, with a noise to signal ratio of 10%. In order to exploit our theoretical 
result, the pilot smoother has to oversmooth otherwise the pilot smoother will have no bias and 
our iterative debiasing procedure has no more justification. So starting with a small A will lead to 
zero or a small number of iterations. Oppositely, starting with a big A will normally lead to a large 

imsart-generic ver. 2009/08/13 file: pacnheml.tex date: January 21, 2013 



/Recursive Bias Estimation 



13 



number of iterations. We decide in this section to use the values by default in the ibr R-package. The 
thin plate spline is govern by a single parameter A that weights the contribution of the roughness 
penalty. For estimating a d- valued regression function, the parametric component is Mq = {'^"J^t'i^) 
and we choose A such that the initial degree of freedom of the pilot smoother equals equals I.SMq. 
The implementation for the kernel smoother is different since we could choose a different bandwidth 
for each explanatory variables. We choose one bandwidth for each explanatory variable X, such as 
the effective degree of freedom for the one-dimensional smoothing matrix related to Xi has a trace 
equal to 1.1 (more degree than a constant but less than a linear model). For such values, the pilot 
smoothers always oversmooth. 

Our simulations was designed to allow us to investigate three aspects: First, compare the perfor- 
mance of the thin plate spline with smoothing parameter selected by GCV with the IBR smoother 
using a thin plate spline with a large smoothing parameter. We expect that adaptation of our 
method will translate into a better performance of our smoother over the optimal TPS smoother. 
Second, to compare the performance between IBR smoother using either TPS and kernel based 
smoothers. Since kernel smoothers do not have a parametric component (which may, or may not, 
be needed to fit the data), we believe that kernel smoothers use more effectively their degree of 
freedom, which translates into better performance. Third, we want to compare the performance of 
fully nonparametric smoothers and additive smoothers. While with additive models we estimate an 
approximation of the true regression function, it is generally believed that the approximation error 
of an additive model is smaller than the estimation error of a fully multivariate smoother even for 
dimensions for small sample sizes, e.g. n = 50, 100, and moderate dimensions of the covariates, e.g. 
d = 5. The results of our study are summarized in Table 5 and Figure 5. 

Figure 5 shows nine panels each containing the boxplots of the MSB from 500 simulations, 
on a logarithmic scale on the y-axis. Moving from top to bottom ranges the regression func- 
tions from the function of three variables sin(27r(xij;2)"'^/^) + cos(27r(a;2X3)^/^), to the function 
of five variables sin(27r(a;ia;2a;3)^/^) -I- cos(27r(a;3X4a;5)^/'^) and to the function of seven variables 
sin(27r(a;ia;2CC3a;4)"'^/^) -I- cos(27r(a;4a;5a;6a;7)^/^). All the covariates are i.i.d. uniforms on the interval 
(1, 2). Moving from left to right changes the sample size from n = 50, 200, 800. Within each panel, 
the boxplot of MSE is shown, in the order from left to right, of additive models using the function 
gam from the R package mgcv , TPS with optimal smoothing parameter using the function Tps 
from the R package fields, iterative bias reduction with TPS smoother using the function ibr from 
the ibr R package and iterative bias reduction with kernel smoothers, using again the ibr function. 
For reasons explained in Section 4, no TPS smoothers can be evaluated for the d = 7, n = 50 panel. 

Figure 5 shows that a fully nonparametric smoother is always preferred to an additive smoother, 
even for relative small sample sizes and moderate dimensions. 
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gam tps ibr— tps ibr— k 



1 1 1 1 

gam tps ibr— tps ibr— k 



gam tps ibr— tps ibr— k 



^3 



-f- 



gam tps ibr— tps ibr— k 



1 1 1 1 

gam tps ibr— tps ibr— k 



gam tps ibr— tps ibr— k 



-f- 



I ' r- 

I I 



jam tps ibr— tps ibr— k 



I I I I 

gam tps ibr— tps ibr— k 



gam tps ibr— tps ibr— k 



Fig 5. Boxplot of Mean Squared Error (MSE) of smoothers for the regression functions (from top to bottom) of three 
variables sin{27r(3;i3;2)"'"^'^) + cos{2it{x2Xz)^^'^) , five variables sin(27r{a;ia:2X3)-'-/^) + cos(27r(x3X4a;5)-'-/^) and seven 
variables s\n{2-K(xiX2X2,X4)^^'^) + cos(27r(x4X5X6X7)^/*), and of sample size (from left to right) of n = 50,200,800. 
Each panel shows the boxplot of the MSE of a GAM smoother, TPS smoother, IBR with TPS smoother and IBR 
with kernel smoother. 

In extensive simulations, to be reported in a follow-on paper, we observe that this qualitative 
conclusion holds over a wide variety of regression functions. Generally, as expected, the TPS with 
optimal smoothing parameter has a somewhat worse performance than the TPS IBR smoother. 
And finally, the kernel based IBR smoother is slightly better than the TPS based IBR smoother, 
especially in higher dimensions. 

Table 5 gives further insight into the performance of the various smoothers. Our table presents 
the ratio of the median MSE (in 500 simulation runs) of various smoothers to the median MSE of 
the kernel based IBR smoother. Since all the entries are larger than one, we conclude that kernel 
based IBR consistently outperforms the other smoothing procedures over the range of sample size, 
number of covariates and regression functions we considered in our study. 
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function. 




gam 


tps 




ibr-k 




50 


2.59 


1.63 


1.39 


I 




100 


4.59 


1.89 


1.58 


1 




200 


8.38 


2.14 


1.73 


1 




500 


17.9 


2.56 


2.08 


1 




800 


27.4 


2.82 


2.39 


1 




50 


6.72 


1.70 


1.09 


1 




100 


12.0 


1.80 


1.19 


1 




200 


22.3 


1.91 


1.27 


I 




500 


46.2 


1.99 


1.45 


1 




800 


67.3 


2.04 


1.51 


1 




50 


2.16 


1.60 


1.47 


I 




100 


3.83 


1.42 


1.39 


1 




200 


6 64 


1 28 


1 24 


^ 




500 


13.17 


1.24 


1.22 


1 




800 


19.44 


1.26 


1.23 


1 




50 


3.62 


1.26 


1 


1 




100 


6.32 


1.76 


1.15 


1 


sin(27r(2;i2:2^3)"^^^) + cos(27r(a:3a:4a:5)^'''^) 


200 


10.0 


1.95 


1.31 


I 




500 


18.6 


2.06 


1.38 


1 




800 


26.5 


2.18 


1.46 


1 




50 


2.05 






1 




100 


3.11 






1 


XiX2X3XiX5X6Xr 


200 


5.26 


3.53 


3.17 


1 




500 


9.85 


2.46 


2.45 


1 




800 


13.8 


2.07 


2.07 


1 




50 


3.16 






1 




100 


4.38 






1 


sin(27r(a:i2;2a^32;4)"^^'^) + cos{2-k{x4X5XqX7)^^^) 


200 


6.43 


1.78 


1.57 


1 




500 


11.1 


1.37 


1.31 


1 




800 


14.9 


1.27 


1.22 


1 



Table 1 

Ratio of median MSB over 500 simulations of a smoother and the median MSB over 500 simulations of the kernel 
based IBR smoother. The smoothers, from left to right, are Generalized Additive Model ( GAM), TPS with 
optimally selected smoothing parameter (tps), TPS based IBR (ibr-tps) and kernel based IBR (ibr-k). 

The improvement over a GAM model ranges from 100% to 6000%. This reinforces our conclusions 
that fully nonparametric regressions are practical for moderately large number of covariates, even 
for sample sizes as small as n = 50. 

The other notable observation is that the values in the ibr-tps column are always less than those 
in the tps column, showing that consistently, the TPS based IBR smoother has better performance 
than TPS with optimal smoothing parameter. In our simulation study, the typical improvement is 
of 20%. 



5.1. Boston housing data 

We apply our method on the Boston housing data. This dataset, created by Harrison and Rubinfeld 
[1978] has been extensively to showcase the performance and behavior of nonparametric multivariate 
smoothers, sec for example Breiman and Friedman [1995] and more recently by Di Marzio and Taylor 
[2008]. The data contains 13 explanatory variables describing each of 506 census tracts in the Boston 
area taken from the 1970 census, together with the median value of owner-occupied homes in SlOOO's. 
The sample size of the data is n = 506 and the number of explanatory variables d = 13. 

We compare our method with the MARS algorithm of Friedman [1991] as implemented in the R 
package mda, with projection pursuit regression (function ppr), additive models using the backfit- 
ting algorithm of Hastie and Tibshirani [1995] as implemented in the R package mgcv, and additive 
Boosting Biihlmann and Yu [2003] from the R package mboost. 



imsart-generic ver. 2009/08/13 file: pacnheml.tex date: January 21, 2013 



/Recursive Bias Estimation 



16 



The predicted mean squared error is estimated by randomly splitting 30 times the data into 
training sets (size n = 350) and testing sets {n = 156). We summarize the results of our analysis in 
the following table: 

Table 2 

Predicted mean Squared Error on test observations for Boston housing data. 



Method 


Mean Predicted Squared Error 


Multivariate regression 


20.09 


1/2 Boost with component- wise spHne 


9.59 


additive model (backfitted with R) 


11.77 


Projection pursuit (with R) 


12.64 (4) 


MARS (with R) 


10.54 


IBR with GCV stopping rule 

and multivariate Gaussian kernel with 

1.1 initial DDL per variable and 1230 iterations 


7.35 



Table 2 again supports our claim that the fully multivariate method presented in the paper leads 
to a reduction of more than 30% in the prediction mean squared error over competing state-of- 
the-art multivariate smoothing methods. A similar comparison for responses on the logarithmic 
scale reveals the even larger reduction of 40% in the prediction mean squared error. Since our fully 
nonparametric regression smoother has substantially smaller prediction error over additive linear 
models and low-order interaction models, we conclude that there exist higher order interactions in 
that data that are significant. 

6. Conclusion 

This paper introduces a fully multivariate regression smoother for estimating the regression function 
m{Xi, . . . , Xd) obtained by successive bias correction from a very smooth (biased) pilot smoother. 
We show that the resulting smoother is adaptive to the underlying smoothness (see theorems 1 
and 2). This adaptation to the underlying smoothness partially mitigates the effect from the curse 
of dimensionality in many practical examples, and make it practical to use fully nonparametric 
smoother in moderate dimensions, even for smaller sample sizes. 

As in L2 boosting, the proposed iterative bias correction scheme needs a weak learner as a base 
smoother S, but all weak learners are not suitable (see theorem 1). For instance, Epanechnikov 
kernel smoothers are not interesting (see Theorem 3). We further note that one does not need to 
keep the same smoother throughout the iterative bias correcting scheme. We conjecture that there 
are advantages to using weaker smoothers later in the iterative scheme, and shall investigate this 
in a forthcoming paper. 

Finally, the R package ibr available at CRAN implements the proposed multivariate nonpara- 
metric method in R. 
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Appendix 



Proof of Proposition 1 

= w-ii -s)''-'SY\\^ 

= \\{i- s){i s)'-'sYr <!!(/- s)r\\b,^,r 
< ii^fc-if, 

where the last inequality follows from the assumptions on the spectrum of / — S*. 

Proof of Theorem 1 Let lyQ < v and fix the smoothing parameter Aq. Define S = Si^g^Xg. The 
eigen decomposition of S [Utreras, 1988] gives 

Al = • ■ • = Xn.r = 1 and ^ t-t < A,- < 7-7, 

where (d+i"o~i) and aj and /3o are two positive constants. We decide to simplify the notation using 



1 + Aoj2''o/d 
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Let us evaluate the variance of the estimator: 

2A/0 , 



1 



1 + Xof^o/'^' 



V{m,,Xo,uo) « a'^ + ^- ^ 1 - (1 

j=Ma + l L ^ 

Choose Jn in j ~ Mq, . . . and split the sum in two parts. Then bound the summand of the first 
sum by one to get 

2 



\/(mfe, Ao, I'o) < cr ho- \ > 

n n n ^ — ' 

As the function 1 — (1 — 71)*^ < fcu for u G [0, 1], we have 



1 + Aoj^^o/rf 



J 2 " / 1 



\2vQld 



< 



T 2 " 1 



j=J„ + l OJ 



Bounding the sum by the integral and evaluate the latter, one has 

'2 1 



j-ivo/d+l 



n n A2(4z/o/d - 1) 

If we want to balance the two terms of the variance, one has to choose the following number of 
iterations A'„ — 0{Jn'^°^'^). For such a choice the variance is of order 



O 



Let us evaluate the squared bias of m^. Recall first the decomposition of S,^g^\g — P^^^KPl^^ and 
denote by Hj^vo = Kolj"^ the coordinate of m in the eigen vector space of S,j^^^Xa- 

1 " 

j = A/o + l 3=jr^ + l 

If m belongs to U'^"^ it belongs to U'^""'^ and we have the following relation by property of 'HS^"' 



(11) 



Using the fact that Xj > 0, we have: 



\ \ ^ \ " ■-2i//d ■2iy/d,,2 I ^ \ " ■-2^'/d ■2i//d,,2 

j=Mo + l J=i„ + 1 



i=A/o + l 
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Using the same type of bound as in equation (11) we get 
Thus the bias is of order 0{jn^'^^'^)- 

Balancing the squared bias and the variance lead to the choice 
and we obtain the desired optimal rate. 



Proof of Theorem 2 We show that conditions (A.l) to (A. 7) given by Li [1987], in theorem 3.2 are 
satisfied. To make the proof self contained, we recall briefly these conditions: (A.l) lim„^oo supi^^^^ A(S'fc) < 
oo, {A.2) Eie^"") < oo, (A3) EfceK„ ("^»(^))"" ^ 0' 

where i?„(fc) =E(|| |P)/n, (A. 4) : inffegK;„ n ^\\ifik — m\\^ 0, in probability. 

{A.5) for any sequence {fc„ G /C„} such that n~^trace(S'fc^S'^^) — we have 

{n^^tracc(S'fc„)}^/{n~-^trace(S'fc„S'[. )} — > 0, {A.6) supj.^^ n~^tracc(S'fc) < 71 for some 1 > 71 > 
0, 

{A.7) sup;,gji^ {n"Hrace(S'fc)}^/{n"Hrace(S'fcS'^)} < 72 for some 1 > 72 > 0. 
Conditions (A.l) to (A.4) 

The eigen values of Sk (denoted as X{Sk)) are between and 1 Vn, thus the condition (A.l) is 
fulfilled. To fulfill condition (A. 3) we need to calculate J^keic ''T'Rn{k)~™' , where m is an integer 
to be found, m„ = {m{Xi), . . . ,m(X„))' and rhk^n = SkY. Using Theorem 1 we have that for an 
optimal choice of fc, Rn{k) = 0{n'^/'^'^^^'^^). Let us choose /C„ such that its cardinal is of order "n? 
(1 < 7 < (2i'o)/'i), we get the order of an upper bound of 'YlikeK^^n{k)~™ is ri'~ 2"+^ . To have 
(A.3) fulfilled we need that 7 - < 0, that is m > ^{2v/d+ 1). Condition (A.4) is satisfied 

because of Theorem 1. 



Conditions (A.5) to (A.7) are related the trace of the matrix Sk and of Sf.. Let us recall first 
some general remarks 



-trace(Sfe) - -\Mq+ [l-(l-A,)'=] 



where the eigen values \j are less than 1, bigger than and decreasing. So trace(S'fe) and trace(S'^) 
are increasing with k. By proposition 1, Vmik^^ao (^trace(S'fc„)) = 1 and the same is true for 
trace(S'^ ) so the choice of the maximal value of fc„ is important as it will be emphasize in the 
proof. The last general remark is the following 

1 \^ 1 1 

— traceS'fc ) < — traceS*? < — traceS'fc < 1. 
n J n n 

Thanks to Utreras [1988], we know that 

A, ~ r — : — , ao = — > 1- 
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Let us write 



So we have 



1 k 



= (1 + V'J 



— trace(5'fe) 
n 



Imo + I ^ (i-[i + A„-\r-]" 

1 " 
-Mo+ V .gfe(j)- 

n ^ — ^ 



i=Mo + l 



We can write 



l-[l + Ao-ij-"«]-' 
l-cxp[-fcln(l + A„-ij--"'')] 
l-cxp[-fcA„ij--""]. 



Let us consider the case where j„ tends to infinity. We want to ensure the following condition: 
— fcnj""" tends to zero. Since kn = "n? consider e > such that e < ao — 7 and assume that 

then -fcAg even when fc is at maximum rate of order . Thus when n grows to infinity, 

V.7 > jn we have the following approximation for gkij)'- 



(12) 



Order of an upper bound of trace(S'fc)/n 
V/c G /C„, we have when n grows to infinity: 



1 AI 1 ]^ ^ 

-trace(5'fc) « — + - V 9k{j) + - V fffc(j) 
" n n ^ — ' " ^ — ' 



< 



< 



Jr. 



1 



n n 
jn , kj 



9kU)dj 



l-Qn 



n n(ao ^ 1) 



-An 



with the last approximation which follows from equation (12). Using the fact that the maximum 
rate for fc„ is 0{n'') and that (7, e) arc chosen such that 7 + e < ao wc have that an upper bound 

of tracc(S'A;)/n is of order of n "o 



Order of a lower bound of trace(S'^)/n 
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Vfc e /C„ , we have when n grows to infinity: 



— trace (5'j.) 
n 



Mo 
n 



In 



> 



1 

n 

Mo , glUn) 



1 



3=Ma+l 3=jn,+l 
n+1 



glU) 



n 

Mo 
n 



n 



(jn - Mo) + 



1 



71 



Jn 



2an 



2ao-l 



with the last approximation which foUows from equation (12). Using one more time equation (12) 
we get that a lower bound of itracc(S'^) if of order of k^n'^~^~'^^~^'' . 

Condition (A. 5) 

Using the previous calculated order, we get that the order of (trace(S'fc)/n)^(trace(5^)/7i)~-^ is of 



order k ^ 



n "0 



Recall that (7, e) are chosen such that 7 + e < ao, thus provided that e < 1 



the condition (A. 5) of Li's theorem are fulfilled. 
Condition (A.6) 

The n eigcn values are decreasing. Consider now, j„ — nC, with Q fixed and less than one. We have 
that the maximal value of the mean of the trace which occurs at A:„ = rC is bounded by 



1 



trace(S'fc) < 



jn , {n-jn) 



h i~' 
'^nJn 



We can easily show that the last quantity is less than a given value smaller than 1. The aim of 
setting k„ equal to n"' is to ensure that at the border of grid /C„; the smoother is not identity 
(ie interpolating). When the smoother is too close to the identity matrix, conditions A.6 and A. 7 
arc not fulfilled anymore. Moreover, being very close to identity is not interesting in a statistical 
viewpoint. 



Condition (A. 7) 

We want to analyze 



Denote by 



so we have 



sup 



aj{k) 



— traceS'fc 
n 



— traces'? 
n 



(n ^trace(S'fe))^ 
n'Hrace(5'^) 

= ln(l-(l-A,)'=), 

i^expaj(fe) 

j 

— exp 2aj{k) 
3 
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Let us show that the ratio is an increasing sequence in k. Let us evaluate the sign of the derivative 
of the ratio. We want to evaluate the sign of 

^^exp Uj (k) 'y^a'j (k) cxp aj (fc) ^^cxp 2a j (fc)— ^^exp aj (fc) ^^exp aj (fc) ^^Q^j- (k) exp 2a j (fc). 
Simplifying and dividing by ^ exp Q!j(fc) ^ exp 2Q!j(fc) leads to 

^E«j (fc)cxpaj(fc) _ cxp2aj-(fc) 
^Eexpaj (fc) i exp 2aj (fc) 

Rewrite as 

The {I3j{k)}j is an increasing sequence bounded by one. It is possible to show that 13 is positive 
using induction. The maximum of the quantity under consideration grid JCn is obtained at the bor- 
der of the grid. Condition (A. 7) can be shown to be fulfilled by using upper bound of trace(S'fe)/n 
and lower bound trace(5'^)/n. 

Proof of Theorem 3 For notational simplicity, we present the proof in the univariate case. Let 
Xi, . . . , Xn is an i.i.d. sample from a density / that is bounded away from zero on a compact set 
strictly included in the support of /. Consider without loss of generality that f{x)>c>0 for all 
|x| < b. 

We are interested in the sign of the quadratic form u'Au where the individual entries Aij of 
matrix A arc equal to 

^ ^ K,,{X,-X,) 

VEi K^,{X., - Xi)^Y.i Kh{X, -Xi) 

Recall the definition of the scaled kernel Kh{-) ~ K{-/h)/h. If v is the vector of coordinate Vi = 
Ui/ y/^i Kii{Xi — Xi) then we have u'Au = v'Kv, where K is the matrix with individual entries 
Kfi{Xi — Xj). Thus any conclusion on the quadratic form w'Kw carry on to the quadratic form 
u'Au. To show the existence of a negative eigenvalue for K, we seek to construct a vector U = 

{Ui{Xi), . . . , Un{Xn)) for which we can show that the quadratic form 



n n 



U'KU = J2Y. Uj{X,)Uk{Xk)Kh{Xj - Xk) 
j=i k=i 

converges in probability to a negative quantity as the sample size grows to infinity. We show the 
latter by evaluating the expectation of the quadratic form and applying the weak law of large 
number. Let f{x) be a real function in L2, define its Fourier transform 



ip{t) = J er^"'''ip{x)dx 

and its Fourier inverse by 
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For kernels K(-) that are real symmetric probability densities, we have 

Kit) = k„,,{t). 

From Boclmcr's theorem, we know that if the kernel K{-) is not positive definite, then there exists 
a bounded symmetric set A of positive Lcbesgue measure (denoted by |^|), such that 



k{t) < Vt G ^. 



(14) 



Let (^(t) G -/j2 be a real symmetric function supported on A, bounded by B (i.e. |<^(t)| < B). 
Obviously, its inverse Fourier transform 



ip{x) 



-2Tzixt 



(p{t)dt 



is intcgrable and by virtue of Parscval's identity 

Using the following version of Parseval's identity [see Feller, 1966, p. 620] 



— CXD ^ — OO 



OO noo 



ipix)ip{y)K{x-y)dxdy^ / \lpit)\^ K{t)dt, 
which when combined with equation (14), leads us to conclude that 

ip{x)(p{y)K{x — y)dxdy < 0. 



^d^ii\x,\<b) 
f<^ii\x,\<b) 



OO — CJO 



Consider the following vector 



U 



£<f^li\X„\<b) J 
With this choice, the expected value of the quadratic form is 

E[g] 



E 

1 

n 



J2 Uj{X,)Uk{Xu)Kn{X, ~ Xu) 

j,k=l 

^is/h)^K,,iO)ds 



b i-b 



bJ-b 



—rip(s/h)ip{t/h)Kh(s ~ t)dsdt 



= h+h- 
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We bound the first integral 



ir = mf^ds 

-b 

b/h 

< — 'J±_L j Loiuf'du 
ncn ' 



nh^ J-b fis) 
KhiO) 



< 



-b/h 

B^\A\K{0) _i 
ch^ " • 



Observe that for any fixed value h, the latter can be made arbitrarily small by choosing n large 
enough. We evaluate the second integral by noting that 



l--')h-'^[ I ip{s/h)tp{t/h)Kh{s -t)dsdt 



n 



-bJ-b 



2 \ /•''/'l /•''/'» 

1 ]h^^ / (piu)(p{v)K{u - v)dudv. (15) 

J -b/h J -b/h 

By virtue of the dominated convergence theorem, the value of the last integral converges to 
jTao K{t)dt < as ft- goes to zero. Thus for h small enough, (15) is less than zero, and 

it follows that we can make IE[(3] < by taking n > jiq, for some large uq. Finally, convergence in 
probability of the quadratic form to its expectation is guaranteed by the weak law of large numbers 
for ?7-statistics [sec Grams and Serfling, 1973, for example]. The conclusion of the theorem follows. 



Proof of Proposition 2 To handle multivariate case, let each component hj of the vector h 
be larger than the minimum distance between three consecutive points, and denote by dh{Xi, Xj) 
the distance between two vectors. For example, if the usual Euclidean distance is used, we have 

The multivariate kernel evaluated at Xi, Xj can be written as K{dh{Xi, Xj)) where K is univariate. 
We are interested in the sign of the quadratic form u'Ku (see proof of theorem 3). Recall that if 
IK is semidefinite positive then all its principal minor [see Horn and Johnson, 1985, p. 398] are 
nonnegative. In particular, we can show that A is not semidefinite positive by producing a 3 x 3 
principal minor with negative determinant. To this end, take the principal minor IK[3] obtained by 
taking the rows and columns (ii, j2,*3)- The determinant of IK [3] is 

det(K[3]) = K{dhm[K{dH{Q)f -K{dh{X,,,X,,)f] 
-K{dh{X,„X,,)) X 

[Kidh iO))K{dh{X,, , X, J) - K{dh{X,, , X,,))K{dh{X,,^ , X,J)] 
+K{dh{X,„X,J) X 

[K{dh{X,„X,,))K{dh{X,„X,,)) - K{dhmK{dh{X,„X,,))] . 
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Let us evaluate this quantity for the uniform and Epancchnikov kernels. 

Uniform kernel. Choose 3 points in with index ii,i2,«3 such that 

dh{Xi^,Xi^) < 1, dh{Xi2,Xi^) < 1, and dh{Xi^,Xi^) > 1. 
With this choice, we readily calculate 

dei(]K[3]) = - A',,(0) [Kh{Of - O] - < 0. 
Since a principal minor of IK is negative, we conclude that K and A arc not scmidcfinitc positive. 

Epanechnikov kernel. Choose 3 points {Xi}^^^ with index «i,?2,*3, such that dh{Xi^, Xi^) > 
iam{dh{X,^, X,^); dhiX,^, X^^)) and set dhiX,^,Xi^) = x <1 and dh{Xi^,Xi^) = y <1. 
Using triangular inequality, we have 

det{K[3]) < 0.75(0.75^ - K{y)^) - K{x){0.75K{x) - K (y) K {nim{x , y))) 
-K{mm{x, y))K{x)K{y) - 0.75K{x + yf 

The right hand side of this equation is a bivariate function of x and y. Numerical evaluations of that 
function show that small x and y leads to negative value of this function, that is the determinant 
of K[3] can be negative. 




0.0 0,2 0.4 0.6 0.8 1,0 



Fig 6. Contour of an upper bound of det{M.[Z\) as a function of{x,y). 

Thus a principal minor of K is negative, and as a result, K and A are not semidefinite positive. 
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